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(54) Compound word recognition 

(57) Recognition of a text string is improved by an- 
alyzing the text string with respect to information about 
expected patterns of the parts of speech of words in the 
text string and by modifying the text string based on the 
analysis. Analyzing may include the combinations of 
parts of speech to parts of speech associated with the 
words in the text string and, if at least one of the com- 
binations of parts of speech matches parts of speech 
associated with the words, indicating that a compound 
word should be formed from the words associated with 
the matched parts of speech. 
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Description 

[0001] The invention relates to computer-implement- 
ed speech recognition. 

[0002] A typical speech recognition system includes 
a recognizer and a stored vocabulary of words which 
the recognizer is capable of recognizing. The recognizer 
receives information about utterances by a speaker and 
delivers a corresponding recognized word or string of 
recognized words drawn from the vocabulary. The 
stored vocabulary often includes additional information 
for each of the vocabulary words, such as the word's 
part of speech (e.g., noun, verb, adverb). 
[0003] In German, consecutive words in a sentence 
are frequently concatenated to form compound words. 
For example, referring to FIG. la, in the string of spoken 
words M er hort daB der President Wahl Kampf Geschich- 
ten geschrieben hat" S (which, translated into English, 
is "he hears that the president has written election cam- 
paign stories"), the words "Wahl," "Kampf," and 'Ges- 
chichten" would be combined to form the compound 
word "Wahlkampfgeschicten." 
[0004] Some German speech recognition systems 
place frequently used compound words in the stored vo- 
cabulary to enable them to recognize those words using 
standard recognition techniques. Other German speech 
recognition systems are trained with text containing 
compound words. During training, such systems identify 
compounds words in the text and also identify the con- 
stituent words which make up the compound words. 
During recognition of German speech, such systems 
form compound words by concatenating words which 
were previously identified as making up compound 
words in the training text. 

[0005] In one aspect, a computer is used to improve 
recognition of a text string including words in a language 
(e.g., German) having associated parts of speech. The 
text string is analyzed with respect to information about 
expected patterns of the parts of speech in the language 
and modified based on the analysis. The information 
may include rules descriptive of combinations of parts 
of speech in the language corresponding to compound 
words in the language. The combinations of parts of 
speech may be sequences of parts of speech. 
[0006] Analyzing may include comparing the combi- 
nations of parts of speech to parts of speech associated 
with the words in the text string and indicating that a 
compound word should be formed from the words as- 
sociated with the matched parts of speech if at least one 
of the combinations of parts of speech matches parts of 
speech associated with the words. Modifying the text 
string may include forming a compound word from 
words in the text string. The compound word may be 
added to a vocabulary. 

[0007] Modifying the text string may include replacing 
words in the text string with the compound word. The 
modified text string may be added to a list of candidate 
text strings. The text string may be analyzed with re- 
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spect to rules descnptive of other unpreferred combi- 
nations of parts of speech in the language correspond- 
ing to combinations of words which do not typically form 
compound words in the language and it may be indicat- 

s ed that a compound word should not be formed from the 
words associated with the matched parts of speech if at 
least one of the unpreferred combinations of parts of 
speech matches parts of speech associated with the 
words. The unpreferred combinations of parts of speech 

10 may correspond to combinations of groups (e.g., pairs) 
of parts of speech, with the groups corresponding to 
phrases. 

[0008] The compound word may be added to a com- 
pound word cache. Adding the compound word may in- 

15 elude increasing the frequency count of the compound 
word in the compound word cache. The compound word 
also may be added to a vocabulary. 
[0009] The text string may be analyzed with respect 
to agreement rules descriptive of patterns of agreement 

20 of case, number, and gender of words corresponding to 
combinations of words which do not typically form com- 
pound words in the language, and it may be indicated 
that a compound word should not be formed from the 
matching words if at least one of the agreement rules 

25 matches words in the text string. 

[0010] The agreement rules may include a rule indi- 
cating that if a noun in a subordinate clause matches 
the case, number and gender of a preceding determin- 
er, a compound word should not be formed from the 

30 noun and subsequent words in the subordinate clause. 
The agreement rules may include a rule indicating that 
if a noun in anon-subordinate clause matches the case, 
number, and gender of a preceding determiner, a com- 
pound word should not be formed from words in the 

35 noun phrase containing the noun and words subsequent 
to the noun phrase. 

[0011] The compound word may be identified as an 
incorrect compound word, and the compound word may 
be added to a compound word error cache. Adding the 

40 compound word to the compound word error cache may 
include increasing a frequency of the compound word 
in the compound word error cache. If the compound 
word has been identified as an incorrect compound 
word, it may be indicated that the compound word 

4$ should not be formed from the words associated with 
the matched parts of speech. The compound word may 
be identified as an incorrect compound word in re- 
sponse to action of a user by adding the compound word 
to a compound word error cache. It may be indicated 

so that the compound word should not be lormed from the 
words associated with the matched parts of speech if 
the compound word has been identified as an incorrect 
compound word more frequently than the compound 
word has not been identified to be an incorrect com- 

55 pound word. 

[0012] Among the advantages of the invention are 
one or more of the following. 

[0013] Use of language-specific compounding rules 
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to reccgnize compound words allows recognition of 
compound words which are not in the stored vocabulary. 
A speech recognition system that is capable of recog- 
nizing compound words may, therefore, use a stored vo- 
cabulary which contains only ordinary (non-compound) 
words, or which contains only a small number of fre- 
quently-used compound words. Reducing the number 
of compound words that are stored in the stored vocab- 
ulary reduces the amount of time and effort needed to 
generate the vocabulary and reduces the total size of 
the vocabulary. The ability to recognize compound 
words not stored in the vocabulary also potentially in- 
creases the total number of recognizable compound 
words. Reduction in vocabulary size may also result in 
increased recognition speed. Furthermore, the space 
that is saved may be used for other purposes, such as 
storing domain-specific vocabularies. 
[0014] Use of compounding rules to recognize com- 
pound words also facilitates modification of the speech 
recognition system's compound word recognition capa- 
bilities. The set of compound words recognized by the 
speech recognition system may be changed by adding, 
deleting, or modifying the compounding rules, rather 
than by modifying the stored vocabulary. This feature 
also facilitates addition of compound word recognition 
capabilities to existing speech recognition systems. 
[0015] The techniques may be implemented in com- 
puter hardware or software, or a combination of the two. 
However, the techniques are not limited to any particular 
^hardware or software configuration; they may find appli- 
cability in any computing or processing environment that 
may be used for improvement of speech recognition. 
Preferably, the techniques are implemented in computer 
programs executing on programmable computers that 
each include a processor, a storage medium readable 
by the processor (including volatile and non-volatile 
memory and/or storage elements), at least one input de- 
vice, and one or more output devices. Program code is 
applied to data entered using the input device to perform 
the functions described and to generate output informa- 
tion. The output information is applied to the one or more 
output devices. 

[0016] Each program is preferably implemented in a 
high level procedural or object-oriented programming 
language to communicate with a computer system. 
However, the programs can be implemented in assem- 
bly or machine language, if desired. In any case, the lan- 
guage may be a compiled or interpreted language. 
[0017] Each such computer program is preferably 
stored on a storage medium or device (e.g.. CO-ROM, 
hard disk or magnetic diskette) that is readable by a gen- 
eral or special purpose programmable computer for 
configuring and operating the computer when the stor- 
age medium or device is read by the computer to per- 
form the procedures described in this document. The 
system may also be considered to be implemented as 
a computer-readable storage medium, configured with 
a computer program, where the storage medium so con- 
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figured causes a computer to operate in a specific and 
predefined manner. 

[0018] Other features and advantages of the inven- 
tion will become apparent from the following description. 
5 including the drawings, and from the claims. 

[001 9] An example according to the present invention 
wilt be described with reference of the accompanying 
drawings, in which: 

[0020] FIG. 1a is a diagram of a sequence of German 
10 words spoken by a user and a sequence of correspond- 
ing recognized words. 

[0021] FIG. 1b is a diagram of a category sequence 
corresponding to the sequence of recognized words 
shown in FIG. 1a. 
is [0022] FIG. 2 is a block diagram of a computer 
[0023] FIG. 3 is a diagram of a choice list of possible 
sentence choices. 

[0024] FIG. 4 is a diagram of a sequence of word iden- 
tifiers and a vocabulary stored in a computer-readable 
20 memory. 

[0025] FIG. 5 is a flow chart of a computer-implement- 
ed method for concatenating words in a sequence of 
words into compound words. 

[0026] FIG. 6 is a flow chart of a computer-irnplement- 
25 ed method for matching syntactic templates against a 
category sequence. 

[0027] FIG. 7a is a diagram of a sequence of recog- 
nized words, a corresponding category sequence, and 
a syntactic template. 
30 [0028] FIG. 7b is a diagram of-a sequence of recog : 
nized words, a corresponding category sequence, and 
a syntactic template which matches part of the category 
sequence. 

[0029] FIG. 7c is a diagram of a category sequence 
35 which includes a boundary flag. 

[0030] FIG. 8 is a flow chart of a computer-implement- 
ed method for applying agreement rules to a category 
sequence. 

[0031] FIG. 9 is a flow chart of a method for cone-ate- 

-to nating words into compound words. 

[0032] FIGS. 10a- 10c are diagrams of a sentence 
choice in various stages of the compounding process. 
[0033] FIG. 11 is a diagram of a choice list with a sen- 
tence choice including a compound word. 

45 [0034] FIG. 12 is a flow chart of a method for adding 
compound words to a compound word cache and to a 
vocabulary. 

[0035] FIG. 1 3 is a flow chart of a method for correct- 
ing an incorrect compound word. 

so [0036] FIG. 14 is a flowchart of a method for improv- 
ing recognition of compound words. 
[0037] Referring to FIG. 2, to correctly recognize com- 
pound words spoken in German or other languages, a 
computer 202 includes a compounder process 200 

55 stored in a memory 204. When presented with a sen- 
tence choice 10 (FIG. la) corresponding to a string of 
German words 3 spoken by a user, the compounder 
process 200 identifies the words "Wahl." "KarnpV and 
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"Geschichten" as words to be concatenated into a com- 
pound word, and then concatenates them into the com- 
pound word "WahlfKampfGeschichten." 
[0038] When a user speaks the string of words 3 into 
a microphone 206. analog signals representing the us- 
er's speech are sent to the computer 202, converted 
from analog into digital form by an analog-to-digital (A/ 
D) converter 208. and processed by a digital signal proc- 
essor (DSP) 210. The processed speech signals are 
stored as processed speech 21 1 in memory 204. A con- 
tinuous speech recognizer process 212 uses the proc- 
essed speech 211 to identify the start and end of each 
spoken sentence, to recognize words in the sentence, 
and to produce a choice list 220 of possible sentence 
choices 10. 14 : and 16 (FIG. 3). A suitable continuous 
speech recognizer process is part of NaturallySpeak- 
ing™, available from Dragon Systems, Inc. of West 
Newton, Massachusetts. Each of the sentence choices 
10,14, and 1 6 represents a possible match for the string 
of words 8 spoken by the user. The choice list 220 is 
stored in memory 204 and is ordered such that the most 
likely correct sentence choice 10, as determined by the 
recognizer process 212, is at the top of the choice list 
220. 

[0039] The sentence choices 10, 14, and 16 are 
stored in memory 204 as sequences of word identifiers. 
For example, referring to FIG. 4, sentence choice 10 is 
represented in memory 204 as a sequence of word iden- 
tifiers 400 uniquely identifying vocabulary entries in the 
stored vocabulary 214. For example, the word "er" 10a 
in sentence choice 10 is represented in memory 204 by 
a word identifier 400a that matches the "WORD ID" field 
of a vocabulary entry 408 in the stored vocabulary 214. 
The "NAME" field in the vocabulary entry 408 is the 
string tt er," the "PRONUNCIATION" field contains a 
pointer to a speech model of the word "er," and the "CAT- 
EGORY TAG" field contains information such as the part 
of speech of the vocabulary entry 408, e.g., that it is a 
noun. . 

[0040] Referring to FIG. 5, the compounder process 
200 forms compound words from the words lOa-j in the 
most likely correct sentence choice 10 of the choice list 
220 as follows. The compounder process 200 creates a 
category sequence 12 (FIG. 1b) containing a sequence 
of categories 12a-j corresponding to the words 10a-j in 
the most likely correct sentence choice 10 (step 500). 
For example, category 1 2e (noun) corresponds to word 
I0e ("President"). Each of the categories !2a-j is de- 
rived from the category tag in the corresponding word's 
vocabulary entry in the stored vocabulary 214. 
[0041] The compounder process 200 matches the 
category sequence 12 against syntactic templates 224 
which are also stored in memory 204 (step 502). As de- 
scribed in more detail below with respect to FIG. 6, the 
syntactic templates 224 are used to identify words within 
the sentence choice 10 which should not be concate- 
nated with other words to form compound words, by de- 
fining sequences of word categories which typically do 



not result in creation of compound words in German. 
[0042] Each syntactic template 224 includes a pair of 
phrasal templates drawn from phrasal templates 222. 
stored in memory 204. A phrasal template defines a se- 
quence of word categories. Six phrasal templates used 
by the compounder process 200 are shown in Table 1 , 
be tow. 

Table 1 



15 



20 



25 



Phrasal Template Label 


Phrase 


PH1 


P GAP N 


PH2 


N/ 


PH3 


N V 


PH4 


N VV 


PH5 


oos GAP N 


PH6 


N+ 



35 



40 



45 



50 



55 



[0043] Within a phrasal template, "P" represents a 
preposition, "N" represents a noun, "GAP" represents 
any string of one or more words that does not include a 
noun or a personal pronoun. 7" represents a past par- 
ticiple, "V" represents a verb infinitive. " VV" represents 
an inflected verb, "oos" represents a subordinate con- 
junctor, and "N+" represents one or more nouns. Phras- 
al template PH4, for example, represents a phrase con- 
sisting ol a noun followed by an inflected verb. 
[0044] The set of syntactic templates 224 used by the 
compounder 200 is shown in Table 2. below. Syntactic 
template R1 . for example, consists of the phrasal tem- 
plate PHI followed by the phrasal template PH2. The 
compounder process 200 uses the syntactic templates 
224 shown in Table because, in German, if the catego- 
ries of a sequence of words match a sequence of cate- 
gories defined by a syntactic template, then words in the 
sequence whose categories cross a phrasal template 
boundary are typically not concatenated to form a com- 
pound word. 

Table 2 



Syntactic Template 


Phrasal Templates 


R1 


PHI PH2 


R2 


PH1 PH3 


R3 


PH1 PH4 


R4 


PH5 PH2 


R5 


PH5 PH3 


R6 


PH5PH4 


R7 


PH5PH6 



[0045] Referring now to FIG. 6, the compounder proc- 
ess 200 matches the syntactic templates 224 against 
the category sequence 12 as follows. The compounder 
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process 200 selects a syntactic template (step 600), e. 
g., syntactic template R7 in Table 2. A pointer p is set to 
point to the beginning of category sequence 12 (step 
601). The compounder process 200 compares the se- 
lected syntactic template to the category sequence 12 
beginning at point p (step 602). For example, the com- 
pounder process 200 compares syntactic template R7 
(containing the phrasal templates [oos GAP n]and[N+]) 
to the beginning of category sequence 12. As shown in 
FIG. 7a, since the first category in the selected syntactic 
template is a subordinate conjunctor and the first cate- 
gory in category sequence 12 is a noun, the comparison 
fails. 

[0046] If the comparison fails (decision step 602), 
then the compounder process 200 advances the pointer 
p to the next category in category sequence 12 (step 
607) and compares the selected syntactic template 
against the category sequence 1 2 beginning at the new 
point p (step 602). 

[0047] If the comparison at step 602 succeeds, then 
a boundary flag is placed after the category in the cate- 
gory sequence 1 2 corresponding to the last word in the 
first phrasal template of the selected syntactic template 
(step 604). For example, as shown in FIG. 7b. syntactic 
template R7 matches the categories of the words "daB 
der President Wahl Kampf Geschichten." As a result, a 
boundary flag 18 is inserted into category sequence 12 
after category 1 2e (corresponding to "President-) and 
before category 12f (corresponding to "Wahl"), corre- 
sponding to the boundary between the two phrasal tenv 
plates in syntactic template R7. The resulting category 
sequence 12 is shown in FIG. 7c. 
[0048] The compounder process 200 continues to 
match syntactic templates against the category se- 
quence 12 until all syntactic templates have been com- 
pared with all subsequences of the category sequence 
12. 

[0049] Referring again to FIG. 5, after matching the 
syntactic templates against the category sequence 12. 
the compounder process 200 applies agreement rules 
to the category sequence 1 2 (step 504), The agreement 
rules make use of agreement of case, gender, and 
number within the sentence choice 1 0 to further identify 
which words within the sentence choice 10 should not 
be concatenated to form compound words. 
[0050] A "determiner" is defined as any word that is a 
definite or indefinite article, a personal pronoun, a de- 
monstrative pronoun, or a possessive pronoun. As 
shown in FIG. 8, if there are no determiners within the 
category sequence 12 (decision step 800), then the 
agreement rules are not applicable. Otherwise, the com- 
pounder process 200 identifies the first determiner in the 
category sequence 12 (step 802) and identifies the first 
noun, if any, in the clause begun by the determiner in 
case, number, and gender (step 804). If such a noun is 
found (decision step 806), then: (1) if the noun is in a 
subordinate clause (decision step 808), a boundary flag 
is placed in the category sequence 12 after the noun 



(step 610) and after each word in the noun phrase fol- 
lowing the noun (step 61 2). (2) if the noun is not in a 
subordinate clause (decision step 808), then a boundary 
flag is placed in the category sequence 12 after the end 
s of the noun phrase (step 81 4). This process is repeated 
for each determiner in the category sequence 1 2. Place- 
ment of boundary flags guards against overgeneration 
of compound words. A greater or fewer number of 
boundary flags may be placed within the category se- 
w quence 1 2 depending on the extent to which generation 
of compound words is favored. 
[0051] Referring again to FIG. 5. after the compound- 
er process 200 applies agreement rules to the category 
sequence 12. the compounder process applies corn- 
's pounding rules to the category sequence to determine 
which words in the sentence choice 10, if any, should 
be concatenated into compound words (step 506). A 
compounding rule defines a category sequence. The 
compounder process 200 concatenates sequences of 
20 words whose categories match a sequence of catego- 
ries defined by a compounding rule : unless there is a 
boundary flag within the sequence of words. The com- 
pounding rules used by the compounding process 200 
are shown in Table 3. 

25 



Table 3 





Compounding Rule 


Category Sequence 




C1 


N N 


30 


■ C2 ' 


" "N_N N 




C3 


Pcdz V 




C4 


acdz V 


35 


C5 


Pit 


C6 


PI 




C7 


P V 




C8 


aN 


40 


C9 


aag 




C10 


cff N 




C11 


cff CTR 


45 


C12 


cff cff 




C13 


caf N 




C14 


cdd N 




C15 


cai N 


50 


C16 


cai V 




C17 


cai / 




C18 


cai // 


55 


C19 


cai a 




C20 


cai ag 




C21 


VL 
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Table 3 (continued) 



Compounding Rule 


Category Sequence 


C22 


Ecdz V 


C23 


E // 


C24 


E / 


C25 


E V 


C26 


ZA ZA 


C27 


ZA cfr ZA 


C28 


cglag 


C29 


cgl// 


C30 


cgl/ 



[0052] As used in Table 3, N_N represents a "new 
noun." If the compounder process 200 encounters a 
capitalized word that is not in the recognition vocabulary 
214, the compounder process 200 assumes that the 
word is a noun and assigns the category N_N to it. As 
used in Table 3. cdz represents the German preposition 
"zu." V represents a verb infinitive, a represents a pred- 
icative adjective, ag represents a conjugated adjective, 
elf represents directions (e.g., North and East), CTR 
represents a country, state. region : or area, caf repre- 
sents any month of the yar, cai represents a hyphenated 
noun (e.g., a noun beginning with Euro- or Geo-), L rep- 
resents a verb infinitive of the German word "lernen" (to 
learn), E represents the German word "ein," ZA repre- 
sents a number, cfr represents the German word "und, 
" and cgl represents words that are prepositions and ad- 
verbs at the same time. The categories used in Table 3 
are derived from a larger set of categories that are as- 
signed to words in the recognition vocabulary 214. 
[0053] Referring to FIG. 9, the compounder process 
200 concatenates words in the sentence choice 10 into 
compound words as follows. The compounder process 
200 makes a copy 20 (FIG. 10a) of the sentence choice 
10 and stores the copy in memory 204 (step 900). The 
compounder process selects the first compounding rule 
(step 902) and compares the sequence of categories 
defined by the compounding rule to the category se- 
quence 1 2 associated with the sentence choice 1 0 (step 
904). If the compounding rule matches any subse- 
quence in the category sequence 12 (decision step 
906), then a loop 908a is entered in which for each 
matching subsequence (step 910), the compounder 
process 200 creates a compound word by concatenat- 
ing the words in the sentence choice copy 20 corre- 
sponding to the subsequence (step 914) if the subse- 
quence does not contain a boundary flag (decision step 
912). The resulting compound word is queued for sub- 
mission to a compound word cache 216 (step 91 5), de- 
scribed in more detail with respect to FIG. 12, below. 
The compounder applies the remaining compounding 
rules to the category sequence 12 (steps 902-919). 



[0054] For example, compounding rule CI (N N) 
matches the words "Wahl" 201 and "Kampf" 20g in sen- 
tence choice copy 20. so the words 20f and 20g are 
compounded, resulting in the sentence choice copy 20 

5 shown in FIG. 10b. Compounding rule C2 (N_N N) 
matches the compound word "Wahlkampf" 20k and the 
word "Geschichten" 20h t so the words 20k and 20h are 
compounded, resulting in the sentence choice copy 20 
shown in FIG. 10c. 

10 [0055] If no compound words were created during ap- 
plication of the compounding rules (decision step 918). 
then application of the compounding rules is complete. 
Otherwise, the sentence choice copy is added to the top 
of the choice list 220 (step 920). The choice list 220 re- 

is suiting from application of the compounding rules to the 
sentence choice 10 is shown in FIG. 11 . The compound 
words are then added to a compound word cache 216 
and to the recognition vocabulary 214 (step 922). Add- 
ing the compound words to the recognition vocabulary 

20 21 4 allows the continuous speech recognizer 21 2 to di- 
rectly recognize future occurrences of such words with- 
out the aid of the compounder process 200. 
[0056] The compound word cache 216 contains com- 
pound words which have previously been created by the 

25 compounder process 200. Associated with each com- 
pound word in the compound word cache 216 is a fre- 
quency corresponding to the number of times that the 
compound word has been recognized. Referring to FIG. 
12, compound words that have been queued for sub- 

oo mission to the compound word cache 21 6 are added to 
the compound word cache 216 and to the recognition 
vocabulary 214 as follows. The compounder process 
200 selects a compound word from the set of compound 
words (step 1000). If the selected compound word is al- 

35 ready in the compound word cache (decision step 
1002), then the frequency of the selected compound 
word is incremented (step 1004). 
[0057] If the selected compound word is not in the 
compound word cache (decision step 1002), then the 

40 selected compound word is added to the compound 
word cache 216 (step 1008) if the compound word 
cache 216 is not full (decision step 1006). If the com- 
pound word cache 216 is full (decision step 1006), then 
the oldest compound word in the compound word cache 

45 21 6 is deleted from the compound word cache 216 and 
from the recognition vocabulary 214 (step 1012). If the 
deleted compound word is frequently used (e.g., if its 
frequency is greater than a predetermined threshold fre- 
quency) (decision step 1014), then the deleted corn- 
so pound word is added to the compound word cache and 
the recognition vocabulary 214-with a new timestamp 
corresponding to the current time (step 1016). Steps 
1012-1016 are repeated as necessary until the com- 
pound word that is deleted is not a frequently used com- 

55 pound word. The selected compound word is added to 
the compound word cache 216 and to the recognition 
vocabulary 214 (step 1008). 

[0058] If there are more compound words in the 
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queue (decision step 1018). then the next compound 
word is selected from the queue (step 1020), and steps 
1002-1016 are repeated. Otherwise, addition of com- 
pound words is complete (step 1022). 
[0059] The compounder process 200 may create in- 
correct compound words. In such cases the user may 
replace the incorrect compound word with a replace- 
ment word. Referring to FIG. 13. when a user replaces 
an incorrect compound word with a replacement word, 
the compounder process 200 removes the incorrect 
compound word from the compound word cache 216 
and from the recognition vocabulary 214 (step 1050). 
Compound words which have been identified by the us- 
er as incorrect are stored in a compound word error 
cache 218. Associated with each compound word in the 
compound word error cache is a frequency indicating 
the number of times that the user has identified the com- 
pound word as being incorrect. If the incorrect com- 
pound word is not in the compound word error cache 
216 (decision step 1052), then the incorrect compound 
word is added to the compound word error cache (step 
1054). Otherwise, the frequency of the incorrect com- 
pound word in the compound word error cache 218 is 
incremented (step 1056). 

[0060] The compounder process 200 can use the 
compound word error cache 218 to improve recognition 
of compound words by not generating compound words 
that were previously identified as incorrect. For exam- 
ple, referring to FIG. 14, a loop 908b may be used in 
.place of the loop 910a (FIG. 9) during compound word 
recognition. For each subsequence of words matching 
a compound rule (step 910), if the subsequence does 
not contain a boundary flag (decision step 912), a can- 
didate compound word is generated by concatenating 
the sequence of matching words (step 924). If the can- 
didate compound word is in the compound error cache 
(decision step 926), and the candidate compound word 
is not in the compound word cache (decision step 928), 
then a compound word is created by concatenating the 
matched words (step 914). If the candidate compound 
word is in both the compound word error cache (decision 
step 926) and the compound word cache (decision step 
926), then a compound word is created from the 
matched words (step 914) only if the frequency of the 
candidate word in the compound word cache is greater 
than the frequency of the compound word in the com- 
pound word error cache (decision step 930). 
[0061] Although elements of the invention are de- 
scribed in terms of a software implementation, the in- 
vention may be implemented in software or hardware or 
firmware, or a combination of the three. 



Claims 

1. In a system for recognizing speech in a language, 
a computer-implemented method for improving rec- 
ognition of a text string, the text string comprising 



words associated with parts of speech, the method 
comprising: 

analyzing the text string with respect to infor- 
5 mation about expected patterns of the parts of 

speech in the language: and 
modifying the text string based on the analysis. 

2. The method of Claim 1, wherein the information 
'0 comprises rules descriptive of combinations of 

parts of speech in the language corresponding to • 
compound words in the language. 

3. The method of Claim 2, wherein combinations corn- 
's p r j S e sequences. 

4. The method of Claim 2 or 3, wherein the analyzing 
step comprises: 

20 comparing the combinations of parts of speech 

to parts of speech associated with the words in 
the text string; and 

if at least one of the combinations of parts of 
speech matches parts of speech associated 
25 with the words, indicating that a. compound 

word should be formed from the words associ- 
ated with the matched parts of speech. 

5. The method of Claim 4, further comprising: 

30 . _ . • - 

analyzing the text string with respect to rules 
descriptive of unpreferred combinations of 
parts of speech in the language corresponding 
to combinations of words which do not typically 

35 form compound words in the language: and 

if at least one of the unpreferred combinations 
of parts of speech matches parts of speech as- 
sociated with the words, indicating-that a com- 
pound word should not be formed form the 

•to words associated with the matched parts of 

speech. 

6. The method of Claim 5. wherein the unpreferred 
combinations of parts of speech correspond to com- 

*s binations of groups of parts of speech, the groups 
corresponding to phrases. 

7. The method of Claim 6. wherein groups comprise 
pairs. 

so 

8. The method of Claim 5. further comprising: 

analyzing the text string with respect to agree- 
ment rules descriptive of patterns of agreement 
55 of case, number, and gender of words corre- 

sponding to combinations of words which do 
not typically form compound words in the lan- 
guage: and 
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if at least one of the agreement rules matches 
words in the text string, indicating that a com- 
pound word should not be formed from the 
matching words. 

5 

9. The method of Claim 8, wherein the agreement 
rules include a rule indicating that if a noun in a sub- 
ordinate clause matches the case, number and 
gender of a preceding determiner a compound 
word should not be formed from the noun and sub- 10 
sequent words in the subordinate clause. 



19. The method of any one of Claims 4 to 18. further 
comprising: 

if the compound word has been identified as 
an incorrect compound word, indicating that the 
compound word should not be formed from the 
words associated with the matched parts of speech. 

20. The method of Claim 1 9, wherein the compound 
word has been identified as an incorrect compound 
word in response to action of a user by adding the 
compound word to a compound word error cache. 



1 0. The method of Claim 8 or 9, wherein the agreement 
rules include a rule indicating that if a noun in a non- 
subordinate clause matches the case, number, and 
gender of a preceding determiner, a compound 
word should not be formed from words in the noun 
phrase containing the noun and words subsequent 
to the noun phrase. 

11. The method of any one of the preceding "claims, 
wherein modifying the text string comprises forming 
a compound word from words in the text string. 



21. The method of any one of Claims 4 to 20, further 
comprising: 

is indicating that the compound word should not 

be formed from the words associated with the 
matched parts of speech if the compound word has 
been identified as an incorrect compound word 
more frequently than the compound word has not 
20 been identified to be an incorrect compound word. 

22. The method of any one of the preceding claims, 
wherein the language comprises German. 



20 



1 2. The method of Claim 1 1 , further comprising adding 25 
the compound word to a vocabulary. 

1 3. The method of any one of the preceding claims, fur- 
ther comprising: 

adding the compound word to a compound 30 
word cache. 



14. The method of Claim 13, wherein adding the com- 
pound word comprises increasing the frequency 
count of the compound word in the compound word 35 
cache. 



15. The method of any one of Claims 1 to 10, wherein 
modifying the text string comprises replacing words 
in the text string with the compound word. 



16. The method of Claim 15, further comprising: 

adding the modified text string to a list of can- 
didate text strings. 

45 

17. The method of any one of Claims 4 to 16, further 
comprising: 



identifying the compound word as an incorrect 
compound word: and 

adding the compound word to a compound 
word error cache. 



18. The method of Claim 17, wherein adding the com- 
pound word to the compound word error cache 55 
comprises increasing a frequency of the compound 
word in the compound word error cache. 
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